This R-Markdown contains the code used to conduct the musical analysis between 2 psychedelic compounds Ayahuasca and LSD.It consists of 3 parts: 1. Data Preprocessing 2. Explortory data visualisations 3. Building 2 classifiers to test how well musical features can differentiate tracks between the two psychedelic cultures.

The script setup:

Data Preprocessing

  1. Filter the data to include only tracks from LSD or Ayahuasca
  2. Calculate a within compound weighted index
  3. Remove duplicates within coumpound groups
  4. Select the top 7000 tracks from each drug group
  5. Merge the dataframes back into one to start the analysis
#Packages needed 
#install.packages("dplyr", "tidyverse", "pacman")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.3
## ✓ tibble  3.0.4     ✓ stringr 1.4.0
## ✓ tidyr   1.0.2     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(pacman)

#Step One: Filter the data
#The data has already been grouped into Drugs from the keyword searches (column = "Drug") so we can subset using this column
data_Ayahuasca  <- CDS_data %>% subset(Drug == "Ayahuasca")
data_LSD <- CDS_data %>% subset(Drug == "LSD")

#Step Two: Calculate the within Drug Weighted Index (TF-IDF equation)
#i) create a drug duplicates column, ii) Calculate new TF term, iii) calculate the new index 
data_Ayahuasca <- data_Ayahuasca %>% group_by(TrackName, Artists) %>% mutate(Drug_Dups = n()) #i 
data_Ayahuasca <- data_Ayahuasca %>% group_by(TrackName, Artists) %>% mutate(Drug_TF = sum(FollowCount / NumOfTracks)) # ii 
data_Ayahuasca <- data_Ayahuasca %>% mutate(Drug_Index = Drug_TF *log10(1 + Inverse)) # iii 

data_LSD <- data_LSD %>% group_by(TrackName, Artists) %>% mutate(Drug_Dups = n()) #i
data_LSD <- data_LSD %>% group_by(TrackName, Artists) %>% mutate(Drug_TF = sum(FollowCount / NumOfTracks))#ii
data_LSD <- data_LSD %>% mutate(Drug_Index = Drug_TF *log10(1 + Inverse)) #iii

#Now we have our ayahuasca data in a data frame with a column for duplicates within compound (Drug_Dups), a column for within compound TF (Drug_TF), and a new within compound index (Drug_Index)

#Step Three: Remove the duplicates within the compound groups 
data_LSD <- data_LSD %>% distinct(TrackName, Artists, .keep_all = TRUE) #removes 9575 tracks (23%)
data_Ayahuasca <- data_Ayahuasca %>% distinct(TrackName, Artists, .keep_all = TRUE) #removes 7743 tracks (52%)

#Step Four: Select the top 7000 tracks from each compound group, ranked by new Drug_index weight 
top_7k_LSD <- as.data.frame(data_LSD) %>% slice_max(Drug_Index, n = 7000, with_ties = F)
top_7k_Aya <- as.data.frame(data_Ayahuasca) %>% slice_max(Drug_Index, n = 7000, with_ties = F)

#Step Five: Merge the dataframes back together to start the data analysis 
CDS_data <- rbind.data.frame(top_7k_Aya, top_7k_LSD) #We now have a dataframe with 14,000 tracks 

Part One: Exploratory Data visualisation

Exploratory data analysis was conducted to compare the two compounds.This took the form of: i) Checking the correlations ii) Boxplots iii) Density histograms iv) Checking the significance of differences using t-tests v) Scatter plots of interesting variables to see if the differences can be seen

#Packages needed: 
#install.packages("ggplot2", "funModeling")
library(ggplot2)
library(funModeling)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
#Getting our data into numerical form 
#Make a numerical dataframe with the musical features - including within each drug grounp
CDS_numerical <- CDS_data %>% select(c(Drug, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence))

Ayahuasca_numerical <- top_7k_Aya %>% select(c(Drug, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence))

LSD_numerical <- top_7k_LSD %>% select(c(Drug, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence))

#Make the drug column into a factor to allow numerical plotting 
CDS_numerical$Drug <- as.factor(CDS_numerical$Drug)
Ayahuasca_numerical$Drug <- as.factor(Ayahuasca_numerical$Drug)
LSD_numerical$Drug <- as.factor(LSD_numerical$Drug)

#Step One: Correlations 
#Check the correlations among the variables using corrplot() function 
library(corrplot)
## corrplot 0.84 loaded
correlations <- cor(CDS_numerical[,2:9]) #Make a variable excluding drug to compare the musical features
corrplot(correlations, method="circle")

##Results: Acousticness has strong negative correlation with danceability, energy, and loudness. 
#Loudness has strong positive correlation with danceability and energy
#Danceability has a moderately strong positive correlation with energy

#Step Two: Boxplots 
#Use the funModeling package and function plotar() to make boxplots of all the musical variables: 
plotar(data=CDS_numerical, target="Drug", plot_type="boxplot")
## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Results: Big differences seen in acousticness, energy and instrumentalness
#Moderate differences seen in danceability, loudness, tempo and valence 
#Minimal differences seen in liveness and speechiness 

#Step 3: Density Histograms  
#Use the funModeling package and function plotar()to make density histograms to compare the distribution of variables 
plotar(data=CDS_numerical, target="Drug", plot_type="histdens")
## Warning: `summarise_()` is deprecated as of dplyr 0.7.0.
## Please use `summarise()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

##Results: Danceability and tempo have somewhat normal and similar distributions between the compound groups
#Acousticness, energy and instrumentalness show huge variation in means and distribution 
#Valence shows similar but highly left skewed distribution for both groups (skew = 0.56, kurt = 2.32)
#Liveness has a highly left skewed but similar distribution (skew = 2.35, kurt = 8.91) for both compounds while loudness has a highly right skewed but similar (skew = -1.27, kurt = 5.20) distribion for both compounds. 


#Step 4: Getting the descriptive statistics and significance of differences between each compound  
profiling_num(CDS_numerical) #Stats of both groups together (overall subset stats)
##           variable         mean     std_dev variation_coef          p_01
## 1     acousticness   0.40467066  0.36864577      0.9109773   0.000024499
## 2     danceability   0.54331271  0.19404900      0.3571589   0.090097000
## 3           energy   0.53406577  0.27290588      0.5109968   0.022500000
## 4 instrumentalness   0.37405771  0.39147732      1.0465693   0.000000000
## 5         liveness   0.18837783  0.16450245      0.8732580   0.046999000
## 6         loudness -10.96507857  5.41746384     -0.4940652 -28.491150000
## 7      speechiness   0.07459249  0.08634546      1.1575624   0.025900000
## 8            tempo 120.88541257 29.43418603      0.2434883  64.655780000
## 9          valence   0.35861748  0.25161787      0.7016330   0.031700000
##           p_05       p_25     p_50      p_75    p_95      p_99    skewness
## 1   0.00024895   0.025900   0.3090   0.78900   0.969   0.99100  0.28657941
## 2   0.18200000   0.412750   0.5680   0.68700   0.826   0.90601 -0.39552102
## 3   0.08479000   0.316000   0.5350   0.76300   0.955   0.99200 -0.07312362
## 4   0.00000000   0.000114   0.1650   0.82000   0.928   0.96300  0.31021774
## 5   0.06560000   0.096275   0.1170   0.21800   0.566   0.87604  2.34931960
## 6 -21.63740000 -13.546000  -9.7355  -7.11975  -4.536  -2.75882 -1.27252291
## 7   0.02840000   0.035200   0.0446   0.06870   0.259   0.44401  3.90848089
## 8  75.37285000  97.231000 122.0185 140.04700 171.846 189.11574  0.14115633
## 9   0.03820000   0.144000   0.3160   0.53500   0.835   0.94701  0.56394702
##    kurtosis       iqr                      range_98           range_80
## 1  1.449082  0.763100           [2.4499e-05, 0.991]  [0.001269, 0.932]
## 2  2.500826  0.274250           [0.090097, 0.90601]     [0.255, 0.782]
## 3  1.932052  0.447000               [0.0225, 0.992]    [0.149, 0.9131]
## 4  1.282778  0.819886                    [0, 0.963]         [0, 0.901]
## 5  8.913906  0.121725 [0.046999, 0.876040000000001]    [0.0767, 0.382]
## 6  5.199263  6.426250         [-28.49115, -2.75882] [-18.332, -5.4179]
## 7 23.274619  0.033500             [0.0259, 0.44401]    [0.0303, 0.148]
## 8  2.657559 42.816000         [64.65578, 189.11574] [81.3009, 159.992]
## 9  2.323957  0.391000             [0.0317, 0.94701]   [0.05519, 0.739]
profiling_num(Ayahuasca_numerical) #Ayahuasca means, sd's, skew and kurtosis 
##           variable        mean     std_dev variation_coef        p_01
## 1     acousticness   0.6394725  0.31199978      0.4879018   0.0020998
## 2     danceability   0.4781168  0.19626932      0.4105050   0.0774990
## 3           energy   0.3695685  0.21920621      0.5931409   0.0138980
## 4 instrumentalness   0.3027125  0.38129281      1.2595873   0.0000000
## 5         liveness   0.1729666  0.15159665      0.8764502   0.0527960
## 6         loudness -13.5944813  5.65375396     -0.4158860 -31.1464200
## 7      speechiness   0.0606477  0.08020739      1.3225134   0.0254000
## 8            tempo 114.5727894 30.76404990      0.2685110  61.3111100
## 9          valence   0.3253003  0.25283776      0.7772442   0.0297000
##         p_05        p_25      p_50       p_75      p_95      p_99   skewness
## 1   0.037595   0.4030000   0.75000   0.905250   0.98200   0.99400 -0.7143363
## 2   0.142000   0.3257500   0.49900   0.626250   0.77805   0.86001 -0.1789691
## 3   0.049795   0.1920000   0.35000   0.522000   0.77100   0.89600  0.4032506
## 4   0.000000   0.0000226   0.02135   0.745000   0.93800   0.97001  0.7021085
## 5   0.069800   0.0957000   0.11200   0.177000   0.47005   0.88700  2.8065340
## 6 -24.446550 -16.6830000 -12.55800  -9.523750  -6.20490  -4.56383 -0.9962732
## 7   0.027700   0.0328000   0.03900   0.051525   0.16005   0.47301  5.7924811
## 8  71.948900  90.0237500 111.82150 134.978500 172.06105 191.37213  0.4390477
## 9   0.036500   0.1100000   0.26500   0.490000   0.83600   0.95201  0.7660715
##    kurtosis        iqr              range_98            range_80
## 1  2.156412  0.5022500    [0.0020998, 0.994]      [0.116, 0.966]
## 2  2.206481  0.3005000   [0.077499, 0.86001]      [0.193, 0.725]
## 3  2.450312  0.3300000     [0.013898, 0.896]    [0.09136, 0.676]
## 4  1.698618  0.7449774          [0, 0.97001]          [0, 0.909]
## 5 11.899438  0.0813000     [0.052796, 0.887]    [0.07879, 0.353]
## 6  4.258405  7.1592500 [-31.14642, -4.56383] [-21.3572, -7.3459]
## 7 44.781118  0.0187250     [0.0254, 0.47301]    [0.0292, 0.0939]
## 8  2.818351 44.9547500 [61.31111, 191.37213] [77.9062, 157.9249]
## 9  2.610993  0.3800000     [0.0297, 0.95201]    [0.0396, 0.7241]
profiling_num(LSD_numerical) #LSD means, sd's, skew and kurtosis 
##           variable         mean     std_dev variation_coef          p_01
## 1     acousticness   0.16986884  0.25336818      1.4915518   0.000013697
## 2     danceability   0.60850867  0.16820083      0.2764148   0.168000000
## 3           energy   0.69856303  0.21630553      0.3096435   0.123990000
## 4 instrumentalness   0.44540293  0.38854205      0.8723383   0.000000000
## 5         liveness   0.20378901  0.17512669      0.8593530   0.042600000
## 6         loudness  -8.33567586  3.59271422     -0.4310046 -20.561010000
## 7      speechiness   0.08853729  0.08994414      1.0158899   0.026400000
## 8            tempo 127.19803571 26.58439062      0.2090000  73.004930000
## 9          valence   0.39193466  0.24593663      0.6274940   0.035200000
##            p_05        p_25      p_50      p_75     p_95      p_99    skewness
## 1   0.000073485   0.0027400   0.03585   0.23000   0.7870   0.95600  1.66377113
## 2   0.296000000   0.5070000   0.62800   0.73000   0.8540   0.92401 -0.50992324
## 3   0.300950000   0.5490000   0.72850   0.88300   0.9770   0.99600 -0.65183275
## 4   0.000000000   0.0015075   0.52050   0.84100   0.9190   0.95000 -0.04923239
## 5   0.061695000   0.0969750   0.12600   0.26225   0.6190   0.87204  2.00809722
## 6 -14.924000000  -9.9250000  -7.71300  -6.03775  -3.8239  -2.17084 -1.60242746
## 7   0.029600000   0.0398000   0.05380   0.08800   0.3010   0.43101  2.71360346
## 8  81.324950000 109.4750000 128.12800 144.02300 170.3813 187.69350 -0.05554600
## 9   0.047295000   0.1820000   0.36200   0.56900   0.8340   0.94400  0.39875011
##    kurtosis        iqr                    range_98            range_80
## 1  4.700480  0.2272600         [1.3697e-05, 0.956]  [0.0002597, 0.599]
## 2  2.946297  0.2230000            [0.168, 0.92401]      [0.373, 0.808]
## 3  2.763896  0.3340000            [0.12399, 0.996]      [0.397, 0.954]
## 4  1.193089  0.8394925                   [0, 0.95]          [0, 0.896]
## 5  7.052634  0.1652750 [0.0426, 0.872040000000001]    [0.07399, 0.417]
## 6  9.228778  3.8872500       [-20.56101, -2.17084] [-12.6613, -4.6639]
## 7 11.826978  0.0482000           [0.0264, 0.43101]     [0.0327, 0.211]
## 8  2.835613 34.5480000        [73.00493, 187.6935]  [89.9908, 160.033]
## 9  2.184401  0.3870000             [0.0352, 0.944]     [0.0782, 0.753]
#Run independant t-tests between the two groups to compare whether the differences are statistically significant (or whether they just emerged by chance). Our point of significance is p < .05

#Significance T-tests 
t.test(CDS_numerical$acousticness ~ CDS_numerical$Drug, var.equal = TRUE) #Acousticness:t = 97.8, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$acousticness by CDS_numerical$Drug
## t = 97.756, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4601874 0.4790198
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##               0.6394725               0.1698688
t.test(CDS_numerical$danceability ~ CDS_numerical$Drug, var.equal = TRUE) #Danceability: t = -42.2, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$danceability by CDS_numerical$Drug
## t = -42.205, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1364477 -0.1243362
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##               0.4781168               0.6085087
t.test(CDS_numerical$energy ~ CDS_numerical$Drug, var.equal = TRUE) #Energy: t = -89.4, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$energy by CDS_numerical$Drug
## t = -89.381, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3362094 -0.3217796
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##               0.3695685               0.6985630
t.test(CDS_numerical$instrumentalness ~ CDS_numerical$Drug, var.equal = TRUE) #Instru: t = -21.9, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$instrumentalness by CDS_numerical$Drug
## t = -21.93, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1554442 -0.1299367
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##               0.3027125               0.4454029
t.test(CDS_numerical$liveness ~ CDS_numerical$Drug, var.equal = TRUE) #Live: t = -11.1, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$liveness by CDS_numerical$Drug
## t = -11.133, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03624894 -0.02539580
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##               0.1729666               0.2037890
t.test(CDS_numerical$loudness ~ CDS_numerical$Drug, var.equal = TRUE) #Loud: t = -65.7, p < .05 
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$loudness by CDS_numerical$Drug
## t = -65.682, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.415743 -5.101868
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##              -13.594481               -8.335676
t.test(CDS_numerical$speechiness ~ CDS_numerical$Drug, var.equal = TRUE) #Speech: t = -19.4, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$speechiness by CDS_numerical$Drug
## t = -19.362, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03071295 -0.02506622
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##              0.06064770              0.08853729
t.test(CDS_numerical$tempo ~ CDS_numerical$Drug, var.equal = TRUE) #Tempo : t = -25.98, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$tempo by CDS_numerical$Drug
## t = -25.98, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.57781 -11.67268
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##                114.5728                127.1980
t.test(CDS_numerical$valence ~ CDS_numerical$Drug, var.equal = TRUE) #Valence: t = -15.8, p < .05
## 
##  Two Sample t-test
## 
## data:  CDS_numerical$valence by CDS_numerical$Drug
## t = -15.806, df = 13998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07489792 -0.05837080
## sample estimates:
## mean in group Ayahuasca       mean in group LSD 
##               0.3253003               0.3919347
##Results are all summarised in Table X of the final project write up

#Step 5: Scatter Plots
#Scatter 1: Acousticness and energy are highly correlated, with big differences between the two means. Let's plot a scatter plot to observe the relationship in these 2 variables between the compound groups: 
ggplot(CDS_numerical, aes(x = acousticness, y = energy)) + 
  geom_point(aes(colour = Drug, alpha=0.01)) +
  scale_color_manual(values = c("#CA3C97", "#EDD9A3")) +
  theme_bw() +
    theme(
      plot.title = element_text(size=12)
    ) + theme(panel.border = element_blank()) + 
  labs(x = "Acousticness", y = "Energy", title = "Scatterplot of Acousticness and Energy between Compound Groups") +
    theme(axis.text.x = element_text(face = 'bold', size = 8),
        axis.text.y = element_text(face = 'bold', size = 8))

#Comments: Here the difference between the compound groups is clearly visable, the LSD tracks cluster around the higher end of energy with low acousticness, while the Ayahuasca tracks cluster at the higher end of acousticness and lower end of energy

#Scatter 2: Valence and Danceability show smaller but still significant differences between the two means. We'll also visualise these using a scatterplot: 
ggplot(CDS_numerical, aes(x = danceability, y = instrumentalness)) + 
  geom_point(aes(colour = Drug, alpha=0.01)) +
  scale_color_manual(values = c("#CA3C97", "#EDD9A3")) +
  theme_bw() +
    theme(
      plot.title = element_text(size=12)
    ) + theme(panel.border = element_blank()) + 
  labs(x = "Valence", y = "Danceability", title = "Scatterplot of Valence and Danceability between Compound Groups") +
    theme(axis.text.x = element_text(face = 'bold', size = 8),
        axis.text.y = element_text(face = 'bold', size = 8))

#Comments: The differences here are less visable but we still see distinct clusters of LSD tracks at the higher end of both valence and danceability, and Ayahuasca being spread much more with many tracks clustering at the high end of danceability but lower end of valence. 

It’s clear to see that musical differences exist between the two groups, but we want to go beyond the significance testing and see how well a classifier will be able to distinguish between the two groups based only on their musical features. Typical classifiers are considered to be good if they can achieve an accuracy above 70%. The higher the accuracy, the more the selected variables are able to digtinguish or explain the two groups. In this context, the classifier will be telling us how well the musical features define each compound group, and thus the musical choice of the culture behind the compound. The higer the accuracy, the more confidence we can have in inferring why the musical features may distribute as they do between the two cultures. Let’s jump in!

Part Two: Building a Classifier

Classiification is the process of predicting categorical outcomes, given input data points. It is essentially an algorithm which compares the variable inputs and maps the input into a specific category. Here our category’s are the compound groups of Ayahuasca (representing our shamanic culture) and LSD (representing our pyshcedelia culture) and out input variables are the musical features. We are not especially interested in the extent to which each variable can explain the categorys, but more testing to see if enough difference exists to create accurate classification which could be used to guide further investigations in the future and support that fundamental differences in the culture exist.

Two approaches will be used to build a classifier, and the accuracy of each will be compared: i) Binomial Logistic Regression Classification ii) Random Forest Classification

Binomial Logistic regression is a fundamental algorithm for supervised machine learning classification. It works by fitting datapoints to a curve, much like linear regression only as the outcome is categorical or binary, they fit a along a Sigmoid “S” function curve, with datapoints clustering at the top (1) and bottom (0) of the curve. Binary Logistic regression is a good starter point for exploring the possibility of classification.

A Random Forest classifier works on the concept of the wisdom of the crowds. It splits the training set into a number of decision trees (ie predictions of which class this track would belong to). Multiple trees are allocated to each decision to predict which class the observation belongs to and the class prediction with the most votes becomes the model’s prediction. It is one of the most accurate methods of classification and becoming widely popular within machine learning.

The Binomial Logistic Regression Classifier:

set.seed(999) #this allows for replication 

#The logistic regression classifier cannot handle all 9 musical variables and so the 5 with the greatest variance between groups were chosen as predictors. These were: acousticness, energy, instrumentalness, danceability, and tempo

#The data 
BLR_Data <- CDS_numerical %>% select(c(acousticness, energy,  instrumentalness, danceability, tempo, Drug))
BLR_Data$Drug <- as.factor(BLR_Data$Drug)

#The functions
#createDataPartition() - splits data into train and test sets based on our target (Drug)
#glm() - this is one of the most versatile of models to use for logistic reg in R 

# Loading caret library - this supports the splitting of data 
require(caret) 
## Loading required package: caret
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
# Splitting the data into train (70%) and test (30%) sets 
BLR_Model_index <- createDataPartition(BLR_Data$Drug, p = .70, list = FALSE)
BLR_Model_train <- BLR_Data[BLR_Model_index, ]
BLR_Model_test <- BLR_Data[-BLR_Model_index, ]

# Training the model - AIC = 8242 
BLR_Model <- glm(Drug ~ ., family = binomial(), data = BLR_Model_train)
# Checking the model
summary(BLR_Model) #There is a big difference between the Null (13586) and Residual (8230) deviance, indicating that our predictors are infact telling us more than a null model 
## 
## Call:
## glm(formula = Drug ~ ., family = binomial(), data = BLR_Model_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.53172  -0.58409   0.06151   0.61796   2.80460  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.0035681  0.1798262 -11.142  < 2e-16 ***
## acousticness     -2.7806421  0.1041778 -26.691  < 2e-16 ***
## energy            3.4142884  0.1544653  22.104  < 2e-16 ***
## instrumentalness  0.8698839  0.0745037  11.676  < 2e-16 ***
## danceability      1.2810164  0.1580537   8.105 5.28e-16 ***
## tempo             0.0018117  0.0009767   1.855   0.0636 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13585.7  on 9799  degrees of freedom
## Residual deviance:  8229.5  on 9794  degrees of freedom
## AIC: 8241.5
## 
## Number of Fisher Scoring iterations: 5
#Convert the Coefficients to odds - to make them more interpretable 
exp(coef(BLR_Model))
##      (Intercept)     acousticness           energy instrumentalness 
##       0.13485325       0.06199869      30.39531204       2.38663386 
##     danceability            tempo 
##       3.60029715       1.00181332
# Creating predictions in the test dataset
pred_BLR <- predict(BLR_Model, BLR_Model_test, type = "response")

#Classification Table - First Training set and then test 
# Converting from probability to actual output
BLR_Model_train$pred_drug <- ifelse(BLR_Model$fitted.values >= 0.5, "LSD", "Ayahuasca")
# Generating the classification table
ctab_train <- table(BLR_Model_train$Drug, BLR_Model_train$pred_drug)
ctab_train
##            
##             Ayahuasca  LSD
##   Ayahuasca      3915  985
##   LSD             831 4069
# Converting from probability to actual output
BLR_Model_test$pred_drug <- ifelse(pred_BLR >= 0.5, "LSD", "Ayahuasca")
# Generating the classification table
ctab_test <- table(BLR_Model_test$Drug, BLR_Model_test$pred_drug)
ctab_test
##            
##             Ayahuasca  LSD
##   Ayahuasca      1655  445
##   LSD             383 1717
#Accuracy = (TP + TN)/(TN + FP + FN + TP)
#Train
accuracy_train <- sum(diag(ctab_train))/sum(ctab_train)*100
accuracy_train   # Accuracy in Training dataset - is 81.5% (81.46)
## [1] 81.46939
# Test
accuracy_test <- sum(diag(ctab_test))/sum(ctab_test)*100
accuracy_test   #Accuracy in Test dataset = Also 80.3%% (80.28)
## [1] 80.28571
#The accuracy's are close indicating that our models are performing well! We are performing at a rate of around 30% above chance 

#Let's explore the True Positive rate (TPR) (ie the Sensitivity)
# Recall or TPR indicates how often does our model predicts actual TRUE from the overall TRUE events.
Recall_TR <- (ctab_train[2, 2]/sum(ctab_train[2, ]))*100
Recall_TR   #Answer: Recall in Train dataset - This is 83.0% 
## [1] 83.04082
# True Negative Rate(TRN) in Train dataset
#TNR indicates how often does our model predicts actual nonevents from the overall nonevents
TNR <- (ctab_train[1, 1]/sum(ctab_train[1, ]))*100
TNR   #Answer: This happens 79.9% of the time 
## [1] 79.89796
# Precision in Train dataset (i.e. how often does the model predict the drug when the drug is actually correct)
Precision <- (ctab_train[2, 2]/sum(ctab_train[, 2]))*100
Precision   #Answer: This happens 80.5% of the time 
## [1] 80.51049
#Calculating the F-Score - F-Score is a harmonic mean of recall and precision. The score value lies between 0 and 1. The value of 1 represents perfect precision & recall. The value 0 represents the worst case.
F_Score <- (2 * Precision * Recall_TR / (Precision + Recall_TR))/100
F_Score #This is 0.82 which is pretty good! 
## [1] 0.8175608
#Calculating the AUC value and ROC curve - this gives us an indication of how good our model is 
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_BLR <- roc(BLR_Model_train$Drug, BLR_Model$fitted.values)
## Setting levels: control = Ayahuasca, case = LSD
## Setting direction: controls < cases
auc(roc_BLR) #The area under the curve is .89 which is pretty dam good! (value runs between 0 to 1 and the closer to 1 the better the model)
## Area under the curve: 0.8902
plot(roc_BLR) #This plots our roc values and we see the curve is far away from the diagonal indicating it is a good model 

Conclusion: Our Binary Logistic Regression model can predict which compund group a track belongs to with an accuracy around 80% using the musical features of acousticness, energy, instrumentalness, tempo and valence. This is 30% above chance level (50%). Our train and test sets show similar accuracy indicating that the model is not over-fitting and our statistical tests of model evaluation indicate that the classifier has good levels of sensitivity, specificity and auc. This is promising and supports that musical differences exist which are string enough to perhaps tell us something more about the culure they belong to.

Random Forest Classifier: This classifier is designed for using many variable predictors and so for this classifier all musical variables can be used. The random forest package in R even calculates how important each variable is to predicting which compound group the track belongs to.

set.seed(777)
#Load packages needed to build and evaluate the models 
library(randomForest) 
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(e1071) 
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
#The data - We can use our CDS_numerical dataset

#Split the data into a training and test set 
RF_split = sample(2, nrow(CDS_numerical), replace=TRUE, prob=c(0.7,0.3))
train_RF = CDS_numerical[RF_split==1,]
test_RF = CDS_numerical[RF_split==2,]


#Cross-validating the model: We're going to cross-validate 10 times to ensure our results are accurate
#First set trControl to the default settings and then run an evaluation on the train data - testing what the optimal number of mtry will be for our model 
trControl <- trainControl(method = "cv",
    number = 10,
    search = "grid")

trEVALUATE <- train(Drug~., train_RF, method = "rf", metric= "Accuracy", trControl = trainControl(), tuneGrid = NULL)
#print results
print(trEVALUATE) # Our evaluation tells us the optimal mtry to use is 2 (accuracy: .83, Kappe = .66) - the model tested 2, 5 and 9 mtry. 
## Random Forest 
## 
## 9829 samples
##    9 predictor
##    2 classes: 'Ayahuasca', 'LSD' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 9829, 9829, 9829, 9829, 9829, 9829, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8280488  0.6558644
##   5     0.8252217  0.6502185
##   9     0.8226500  0.6450957
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
#Search the best MaxNodes 
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = 2)
for (maxnodes in c(5: 15)) {
    set.seed(1234)
    rf_maxnode <- train(Drug~.,
        data = train_RF,
        method = "rf",
        metric = "Accuracy",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = maxnodes,
        ntree = 300)
    current_iteration <- toString(maxnodes)
    store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry) #Most accuracy achieved at 14 Nodes, so 9 Nodes have been used (Accuracy = .81, Kappa - .61)
## 
## Call:
## summary.resamples(object = results_mtry)
## 
## Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 
## Number of resamples: 10 
## 
## Accuracy 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 5  0.7800407 0.8062055 0.8112920 0.8088281 0.8153611 0.8270600    0
## 6  0.7881874 0.8031536 0.8112920 0.8100497 0.8207019 0.8229908    0
## 7  0.7902240 0.8018820 0.8153611 0.8121862 0.8232452 0.8270600    0
## 8  0.7871690 0.8059512 0.8153611 0.8137118 0.8245168 0.8341811    0
## 9  0.7934893 0.8077314 0.8123093 0.8148319 0.8265514 0.8392675    0
## 10 0.7861507 0.8128179 0.8173957 0.8155428 0.8247711 0.8341811    0
## 11 0.7892057 0.8090031 0.8179044 0.8148310 0.8227365 0.8321465    0
## 12 0.7932790 0.8095117 0.8199390 0.8163574 0.8260427 0.8351984    0
## 13 0.7922607 0.8110376 0.8194303 0.8168659 0.8275687 0.8341811    0
## 14 0.7932790 0.8112920 0.8184130 0.8160522 0.8245168 0.8351984    0
## 15 0.7973523 0.8120549 0.8199390 0.8181889 0.8278230 0.8341811    0
## 
## Kappa 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 5  0.5598843 0.6121235 0.6223686 0.6174918 0.6305921 0.6542027    0
## 6  0.5761638 0.6061610 0.6225068 0.6199978 0.6412308 0.6461791    0
## 7  0.5800718 0.6036546 0.6305607 0.6242263 0.6464040 0.6542499    0
## 8  0.5739670 0.6116484 0.6305960 0.6271894 0.6486858 0.6683131    0
## 9  0.5868234 0.6151343 0.6244467 0.6294447 0.6528155 0.6785241    0
## 10 0.5720245 0.6253927 0.6345150 0.6308624 0.6493575 0.6682980    0
## 11 0.5780228 0.6176411 0.6355807 0.6294271 0.6452330 0.6642433    0
## 12 0.5860945 0.6187396 0.6397256 0.6324577 0.6517691 0.6703254    0
## 13 0.5840659 0.6218116 0.6386877 0.6334839 0.6548511 0.6684188    0
## 14 0.5861564 0.6222874 0.6366604 0.6318554 0.6487657 0.6704005    0
## 15 0.5942503 0.6238526 0.6396691 0.6361073 0.6553971 0.6680258    0
#Search the best number of ntrees 
store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
    set.seed(589)
    rf_maxtrees <- train(Drug~.,
        data = train_RF,
        method = "rf",
        metric = "Accuracy",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        maxnodes = 14,
        ntree = ntree)
    key <- toString(ntree)
    store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree) #Model concluded 350 trees was optimal (Accuracy = .80, Kappa = .60)
## 
## Call:
## summary.resamples(object = results_tree)
## 
## Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 250  0.7822991 0.8123093 0.8199390 0.8170726 0.8239633 0.8392675    0
## 300  0.7833164 0.8120549 0.8179044 0.8172760 0.8259977 0.8413021    0
## 350  0.7812818 0.8120549 0.8218834 0.8179881 0.8255341 0.8423194    0
## 400  0.7853510 0.8138352 0.8189217 0.8181917 0.8257439 0.8402848    0
## 450  0.7833164 0.8148525 0.8188310 0.8176828 0.8247711 0.8423194    0
## 500  0.7853510 0.8140895 0.8203580 0.8184968 0.8255341 0.8423194    0
## 550  0.7833164 0.8130722 0.8184130 0.8173780 0.8259984 0.8402848    0
## 600  0.7843337 0.8130722 0.8184130 0.8175815 0.8254012 0.8402848    0
## 800  0.7833164 0.8130722 0.8184130 0.8176832 0.8259984 0.8402848    0
## 1000 0.7873856 0.8123093 0.8194303 0.8181922 0.8276925 0.8392675    0
## 2000 0.7853510 0.8140895 0.8204476 0.8188024 0.8276917 0.8402848    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 250  0.5641069 0.6242844 0.6395625 0.6338813 0.6477354 0.6784070    0
## 300  0.5661735 0.6237479 0.6355795 0.6342905 0.6517592 0.6824778    0
## 350  0.5621000 0.6237952 0.6435073 0.6357150 0.6507783 0.6845060    0
## 400  0.5702274 0.6273098 0.6376146 0.6361171 0.6511815 0.6804351    0
## 450  0.5661339 0.6293674 0.6373466 0.6350938 0.6493484 0.6844916    0
## 500  0.5702274 0.6278507 0.6404072 0.6367304 0.6508300 0.6844916    0
## 550  0.5661339 0.6257716 0.6366091 0.6344911 0.6517170 0.6804351    0
## 600  0.5681610 0.6258272 0.6365837 0.6348902 0.6505406 0.6804205    0
## 800  0.5661735 0.6258187 0.6365025 0.6350983 0.6518359 0.6804205    0
## 1000 0.5743398 0.6242848 0.6386030 0.6361236 0.6551112 0.6783923    0
## 2000 0.5702470 0.6278422 0.6406467 0.6373410 0.6551141 0.6804205    0
#Generate the random forest
Model_RF = randomForest(Drug~., data=train_RF, ntree=350, mtry= 2, proximity=T)
table(predict(Model_RF), train_RF$Drug)
##            
##             Ayahuasca  LSD
##   Ayahuasca      4236  954
##   LSD             734 3905
Model_RF #  Estimate of error rate = 16.63% - pretty good!
## 
## Call:
##  randomForest(formula = Drug ~ ., data = train_RF, ntree = 350,      mtry = 2, proximity = T) 
##                Type of random forest: classification
##                      Number of trees: 350
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 17.17%
## Confusion matrix:
##           Ayahuasca  LSD class.error
## Ayahuasca      4236  734   0.1476861
## LSD             954 3905   0.1963367
plot(Model_RF) #This plot shows us the mean squared error for each drug group - the model doesn't seem to improve much after 150 trees 

importance(Model_RF)#Here the most important variables are acousticness, energy, loudness, speechiness
##                  MeanDecreaseGini
## acousticness            1209.5065
## danceability             365.3115
## energy                   840.8479
## instrumentalness         368.6797
## liveness                 267.1778
## loudness                 722.1561
## speechiness              447.8585
## tempo                    377.6408
## valence                  303.5572
varImpPlot(Model_RF) #Plot of the variables importance 

#Let's test how good our model is: 
Pred_RF = predict(Model_RF, newdata=test_RF)
table(Pred_RF, test_RF$Drug)
##            
## Pred_RF     Ayahuasca  LSD
##   Ayahuasca      1763  428
##   LSD             267 1713
confusionMatrix(Pred_RF, test_RF$Drug) #This has all the metrics you need to predict!! 
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Ayahuasca  LSD
##   Ayahuasca      1763  428
##   LSD             267 1713
##                                           
##                Accuracy : 0.8334          
##                  95% CI : (0.8217, 0.8446)
##     No Information Rate : 0.5133          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6672          
##                                           
##  Mcnemar's Test P-Value : 1.286e-09       
##                                           
##             Sensitivity : 0.8685          
##             Specificity : 0.8001          
##          Pos Pred Value : 0.8047          
##          Neg Pred Value : 0.8652          
##              Prevalence : 0.4867          
##          Detection Rate : 0.4227          
##    Detection Prevalence : 0.5253          
##       Balanced Accuracy : 0.8343          
##                                           
##        'Positive' Class : Ayahuasca       
## 
#Plot to see whether whether the classification is correct (if + it is correct)
plot(margin(Model_RF, test_RF$Drug)) # we have an upwards slope which looks good 
## Warning in RColorBrewer::brewer.pal(nlevs, "Set1"): minimal value for n is 3, returning requested palette with 3 different levels

#Calculating the AUC value and ROC curve - this gives us an indication of how good our model is 
#AUC Calculation 
library(pROC)
library(ROCR)
rf_p<- predict(Model_RF, type="prob")[,2]
rf_pr <- prediction(rf_p, train_RF$Drug)
r_auc <- performance(rf_pr, measure = "auc")@y.values[[1]] 
r_auc    #0.91 recall that this should be as close to 1 as possible, so .91 is a great outcome! 
## [1] 0.908265

Conclusions: The random forest classifier, which can manage all the variable predictors, improved on the accuracy of the Binomial Logistic Regression model by around 3%. This model is predicting at a rate of 84% which is 34% above the rate of chance (50%) and even more promising for our investigation. Further, the random forest model revealed that the variables most useful to predicting were in fact not the ones which showed the greatest variance in the data visualisation boxplots, but included speechiness and loudness. This is a reminder of how machine learning approaches such as classification can improve on exploratory data analysis and reveal trends not so readily seen by the eye.